As per Monash’s integrity rules this assignment needs to be completed independently and not shared beyond this class.
🔑 Instructions
This is an open book exam, and you are allowed to use any resources that you find helpful, including Generative AI.
Write your answers into the solutions part of the exam-solution.qmd file provided, render and upload to your GitHub repo when finished.
Exercises
1. Warm-up (5pts)
The simulated data in c5551.rda has 5 variables. What is its shape? Solid or hollow, sphere, cube, torus, hexagonal prism, ellipsoid, roman surface, or mobius strip? Explain your reasoning.
The shape seen in the projections are circular, sometimes with a hole in the middle. This rules out purely a sphere, cube or ellipsoid. The roman surface and mobius strip are only defined for 3D. A hexagomal prism wouldn’t have a hole.
It is also not solid as can be seen when using a slice.
2. Dimension reduction (25pts)
The data feats_all.rda has a collection of data on 200 time series, common macroeconomic and microeconomic series, extracted from A self-organizing, living library of time-series data. Each of the series has been converted to time series features, using the time series features available in the feasts package. These include measures of the trend, seasonality, autocorrelation, jumps and variance. There are 37 variables, all of which are features.
a. (5pts) Using a grand tour on the full set of 37 variables, describe the structure of this data (e.g. outliers, clustering, linear association, non-linear association). Ignore the variable named type for this exercise.
There are 1-2 outliers, strong association, 3-4 differently shaped clusters.
b. The plots (Figure 1, Figure 2) and table (Table 1) below summarise the principal component analysis of the data. The data containing the first five principal components is available in the feats_pc_d.rda dataset. Ignore the variables named type, cl2, cl3, cl4, cl5, cl6 for this exercise.
(3pts) Explain why two principal components is not enough to summarise the variability in this data.
(2pts) Why would five principal components be a good choice?
(5pts) Using a grand tour to examine the first five PCs. Describe the structure that is still present in the data when it is reduced from 37 variables to 5 (clustering, outliers, nonlinear association).
(5pts) There is an outlier in PC4. On which of the time series features (trend_strength, …, stat_arch_lm) does this time series have high values? So how would the time series of this point appear (strong trend, seasonality, peaks, spikiness, …) ?
Figure 1: Scree plot of the principal component analysis of the economic time series features data.
Table 1: Coefficients of the first five principal components.
Figure 2: Scatterplot matrix of the first five principal components of the economic time series features data.
Solution
Two principal components is not enough because there is substantially more variance explained in the next few. Also, the structure that we saw, outliers and clustering cannot be seen in only two PCs.
Five principal components all explain more variance than would be expected if the data was fully 37-D. Possibly six would be reasonable to use also, becaause there is a drop/elbow there, and then variance explained tapers off slowly.
With five principal components we can still see an outlier, some clustering but not the linear dependence but the non-linear dependence is still visible.
PC4 has large negative coefficients for shift_var_max, shift_level_max and spikiness. Because the outlier is on the bottom end, the double negative says that it is anoutlier because it has high values on these variables. This could be interesting series: spiky, and maybe shifting up and down.
c. (5pts) If you were to make a 5D model and overlay it on the data in 5D, how well do you anticipate it fits? Good fit, poor fit, with reasons.
Solution
The PCA model is essentially a box. But this data has wildly different variance patterns, that do not match a box.
3. (25pts) Clustering
This question uses the time series features data also. Below is the dendrogram of hierarchical clustering conducted on the first five principal components.
Figure 3: Dendrogram summarising the hierarchical clustering of the first five principal components of the time series features data.
a. (3pts) Based on the dendrogram, how many clusters would you suggest are reasonable to consider? Explain your answer.
Solution
Anywhere from 2-10 clusters would be possibilities. Probably around 5 clusters might be best. At 5 clusters there is one point that is in its own cluster, and it might correspond to one of the big outliers.
b. (4pts) Cross-tabulate the two cluster solution cl2 and the original type of series variable type. Using this table, and grand tour of the first five PCs, coloured by each of these two results, describe how they are similar or not.
These results are not similar. The clustering divides the data into a small separated cluster and all the other points. The type variable has two very oddly shaped and not separated clusters.
c. (8pts) Using the grand tour, and possibly a guided tour, come to a decision about which number of clusters (2, 3, 4, 5, or 6) is the best for this data.